SARIMA and ARDL models for predicting leptospirosis in Anuradhapura district Sri Lanka

Leptospirosis is considered a neglected tropical disease despite its considerable mortality and morbidity. Lack of prediction remains a major reason for underestimating the disease. Although many models have been developed, most of them focused on the districts situated in the wet zone due to higher case numbers in that region. However, leptospirosis remains a major disease even in the dry zone of Sri Lanka. The objective of this study is to develop a time series model to predict leptospirosis in the Anuradhapura district situated in the dry zone of Sri Lanka. Time series data on monthly leptospirosis incidences from January 2008 to December 2018 and monthly rainfall, rainy days, temperature, and relative humidity were considered in model fitting. The first 72 months (55%) were used to fit the model, and the subsequent 60 months(45%) were used to validate the model. The log-transformed dependent variable was employed for fitting the Univariate seasonal ARIMA model. Based on the stationarity of the mean of the five variables, the ARDL model was selected as the multivariate time series technique. Residuals analysis was performed on normality, heteroskedasticity, and serial correlation to validate the model. The lowest AIC and MAPE were used to select the best model. Univariate models could not be fitted without adjusting the outliers. Adjusting seasonal outliers yielded better results than the models without adjustments. Best fitted Univariate model was ARIMA(1,0,0)(0,1,1)12,(AIC-1.08, MAPE-19.8). Best fitted ARDL model was ARDL(1, 3, 2, 1, 0),(AIC-2.04,MAPE-30.4). The number of patients reported in the previous month, rainfall, rainy days, and temperature showed a positive association, while relative humidity was negatively associated with leptospirosis. Multivariate models fitted better than univariate models for the original data. Best-fitted models indicate the necessity of including other explanatory variables such as patient, host, and epidemiological factors to yield better results.


Introduction
Leptospirosis is a main tropical disease affecting humans as well as animals [1]. It is caused by a spirochete (a type of spiral-shaped bacteria) of the genus Leptospira of the family Leptospiraceae [2]. Leptospira is excreted to the environment via the urine of infected or carrier animals and enters the human body through a skin breach or a mucous membrane [2]. The severity of leptospirosis varies from mild self-resolving fever to severe disease with the involvement of major organs of the body [3]. These severe cases are considered in disease burden estimations, while mild cases are often neglected from these estimations in most settings [4]. The annual global burden of leptospirosis includes nearly one million cases, 60,000 deaths, and 2.9 million disability-adjusted life years [3]. However, these estimates are considered underestimations due to various reasons, including under notification, deficiencies of diagnostics, diversity of Leptospira, and inaccurate predictions [5][6][7]. Among the reasons for underestimation mentioned above, the major focus of this paper is to describe the gaps associated with disease predictions and propose prediction models for the Anuradhapura district of Sri Lanka.
Global disease estimations named Sri Lanka a hotspot of leptospirosis [8,9]. The estimated annual incidence of leptospirosis is 52.2 per 100,000, with estimated annual deaths of 760 [4]. Nevertheless, these cases and deaths have not been distributed evenly across all parts of the country [4,10]. Sri Lanka has three main climate zones: wet, dry, and intermediate zones [5]. The wet zone receives higher average annual rainfall than the other two zones and has higher humidity. At the same time, solar radiation and temperature are higher in the dry zone [5]. Leptospirosis outbreaks are usually associated with events due to heavy rainfalls such as floods or water-related recreational activities [11,12]. Also, leptospirosis shows associations with the other meteorological parameters such as rainy days per month, solar radiation, relative humidity, and environmental temperature [5,[13][14][15][16]. Since both meteorological parameters and leptospirosis cases are changing across different geographical locations, the association of leptospirosis with meteorological parameters could be different in different regions [5].
Many prediction models have been created to predict leptospirosis in Sri Lanka [5,17,18]. However, almost all studies have focused on a single or few districts for the modeling. Focusing on a single district may not represent the other districts, especially the ones situated in other climate zones. Also, most of the models developed were univariate, while multivariate techniques have not been adequately used for modeling in Sri Lanka [13]. further, only a few papers have targeted the dry zone of Sri Lanka for the prediction models [5,18]. A previous study has demonstrated the differences between the climate zones regarding the prediction models and associations with meteorological parameters of leptospirosis [5]. It has been shown that the dry zone behaves entirely different from the wet zone of Sri Lanka with regard to patient reporting, outbreaks, rainfall, and other meteorological parameters [5]. However, in that study, the authors have not demonstrated the leptospirosis reporting pattern of the individual districts. Demonstrating the models for individual districts is essential as the dry zone extends from the northern part to the southern part of Sri Lanka. The only paper targeting an individual district in the dry zone was Ehelepola et al., published in 2021 [19]. Here, they have demonstrated that the temperature shows an opposite effect in the Kandy district (wet zone) in comparison to the Hambanthota district (dry zone). However, as shown in Fig 1, Hambanthota is not only situated in the dry zone. Therefore, selecting a district situated exclusively in the dry zone is essential to examine the meteorological association of leptospirosis in the dry zone where many predisposing factors for leptospirosis, such as paddy farming, are located.
We have selected the Anuradhapura district for this study to represent only the dry zone of Sri Lanka. Also, the Anuradhapura district has reported leptospirosis incidence of more than 500 per 100,000 population during the recent past [20]. Based on the ongoing research activities, the awareness, diagnosis, and reporting are expected to be high. Unlike in the districts of the wet zone, dry zone districts report episodic leptospirosis cases. Frequent wetness in wet zone facilitates Leptospira causing frequent outbreaks. In contrast, leptospirosis outbreaks in the Anuradhapura district are associated with northeastern monsoon rains from November to January [5]. Anuradhapura reports one of the highest incidents of leptospirosis among the districts in the dry zone [10]. Also, several micro geographical differences related to leptospirosis were noted in the Anuradhapura district compared to the Kandy district of Sri Lanka [12]. In addition, different types of Leptospira have been isolated from the dry zone compared to the wet zone [21]. Therefore, the Anuradhapura district is an ideal geographical setting to study the variations of leptospirosis and also to determine meteorological associations of leptospirosis in the dry zone of Sri Lanka. This study aims to develop mathematical models to predict leptospirosis and determine meteorological associations of leptospirosis in the Anuradhapura district of Sri Lanka.

Study setting
The setting of this study was the Anuradhapura district of Sri Lanka. (Fig 1) It is situated in the dry zone and is considered a geographical area with typical features of the dry zone. Annual rainfall is less than 1750mm. The main rainy season is from November to January and is associated with northeastern monsoon rains [5,22]. During this period, a higher number of leptospirosis cases are being reported. The majority of the population are paddy farmers who engage in paddy cultivation during the monsoon rains [23]. Most of the common socio-demographic risk factors are prevalent in this area [24]. Many papers have been published related to leptospirosis in Anuradhapura, Sri Lanka; nevertheless, none of the studies focused on time series analysis to forecast the future incidence of the district [12,[25][26][27].

Data sources
Leptospirosis data are publicly available on the official webpage of the Epidemiology Unit of Sri Lanka [28]. Therefore, all the data published in the weekly epidemiological report were compiled as monthly data. In addition, data on monthly rainfall, rainy days, relative humidity, and environmental temperature were purchased from the meteorological department of Sri Lanka [22]. The period for the obtained data was from January 2008 to December 2018.

Data analysis
The data analysis was done using Eviews version 10. Fig 2 summarises the analytical framework of data analysis.

Testing for stationary and seasonality
The stationary mean was confirmed using the unit root test by the Augmented Dickey-Fuller test. (Table 1) In the unit root test, the stylized trend-cycle decomposition of a time series y t is given by where TD t is a deterministic linear trend, and z t is the AR(1) process.
The seasonality of the data series was tested using the traditional Hylleberg, Engle, Granger, and Yoo(HEGY) process. (Table 1).

Data transformation and differencing
The monthly leptospirosis incidence distribution had non-stationary variance and stationery mean while showing a significant seasonality effect. Natural log transformation was done to achieve stationary variance. The original value of '0' is identified as a missing value during the log transformation, and the value '1' becomes 0. Linear interpolation using Eviews was performed to avoid the effect on original values of 0 and 1 during log transformation. Regular differencing was not performed as the mean was stationary. Instead, seasonal differencing was performed to remove the seasonality of data. In multivariate analysis, all regressors were Z transformed to avoid the measuring unit effect of the variables.

Outlier adjustment
Outlier adjustment was performed only for univariate analysis as residuals did not follow normal distribution without adjustment for seasonal outliers. Z scores were obtained for the seasonality-adjusted data. Z scores of more than 1.96 were adjusted using the linear interpolating function of Eviews. In the linear interpolation method, linear approximation was made based on the previous and the next non-missing value of the series as an equation and the interpolated value(Iv) is given by, where λ is the relative position of the missing value, P (i-1) is the immediately previous value, and P (i+1) is the immediate next missing value.

Model fitting
The first 72 time points (January 2008 to December 2013) were used to create the models, and the following 60-time points(January 2014 to December 2018) were used to validate them in both univariate and multivariate modeling.

Univariate analysis
The autocorrelation function (ACF) and Partial autocorrelation function (PACF) of the variable of monthly leptospirosis patients were plotted. Based on the significant lags of ACF and PACF, the Seasonal ARIMA model was fitted. The model designation ARIMA (p, d, q) (P, D, Q) s consists of regular autoregressive (AR-p) and moving average (MA-q) terms, which account for correlations with low lags. In addition, seasonal AR(P) and seasonal MA(Q) terms account for correlations with seasonal lags. For this study, seasonality was analyzed over a time frame of months, i.e., there were 12 time periods in one season(Seasonality for the model is 12 months). The terms' D' and 'd' indicate the number of seasonal differencing and regular differencing, respectively. They were used to make the mean of the data series stationary. The term 's' indicates the seasonality. The SARIMA model specification is (1), e t = error term.

Multivariate analysis
The autoregressive distributed lag (ARDL) was selected as the method to create a multivariate model. ARDL technique can be used for variables with a mixed order of integration. Irrespective of the difference order-I(0) or I(1) or a combination-ARDL models can be applied. Also, all underlying variables stand as a single equation in ARDL models [29]. In the data set, the variables of rainy days and relative humidity achieved a stationary mean in the first differ-ence{I(1)}. The rest of the three variables, including the dependent variable, were stationary at the level{I(0)} (Tables 1 & 4). The ARDL (p,q 1 ,q 2 ,q 3 ,q 4 ) model specification is where where p = number of patients; q 1 , q 2 , q 3 , and q 4 represent rainfall, rainy days, relative humidity, and temperature, respectively; L = the lag operator and w t = the vector of deterministic variables.

Model validation
Serial autocorrelation of the SARIMA model was detected using Durbin Watson (DW) Statistic. DW around 2 is considered as no autocorrelation. To identify serial autocorrelation of the best-[fitting]multivariate model, the Lagrange multiplier (LM) test was applied. The LM test is the product of the coefficient of determination (R 2 ) from auxiliary regression and the sample size, LM = nR 2 , where R 2 = coefficient of determination and n = number of time points. A non-significant (i.e., P> 0.05) value for the LM test indicates that the residuals of the time series model are not serially autocorrelated, and the Jarque-Bera(JB) test was used to detect the normality of residuals. The ARCH test was used to determine the heteroskedasticity of the residuals. In addition, the lowest Akaike information criterion (AIC) and the lowest mean absolute percentage error (MAPE) were used to select the best-fitting models. The formulas for calculating JB statistic, MAPE, and AIC are as follows; where S = skewness and K = kurtosis.
where n = number of time points, O t = observed value, y t = forecasted value where K is the number of model parameters.
where e t is the residual figure, and T is the number of observations in the experiment.

Ethics statement
This study includes only secondary data. There is no significant ethical issue. Therefore, ethical clearance was not obtained for this study.

Univariate analysis
Monthly leptospirosis cases in the Anuradhapura district from January 2007 to April 2019 are displayed in Fig 3A. Seasonally is a predominant feature in the distribution. Kernel density indicates that most of the values are within a certain limit, and therefore, the distribution peaks can be considered as positive outliers. However, these outliers were observed neither seasonally nor in regular intervals. As the variance of the original distribution is not stationary, the variable is transformed into natural logarithm to achieve the variance stationarity and the results are displayed in Fig 3B [30]. Also, seasonality is more evident than in the original distribution after transformation. Nevertheless, the outliers of the distribution were still obvious even after transformation but were less prominent than the original distribution. Next, the Autocorrelation function (ACF) and the partial autocorrelation function (PACF) were plotted to confirm the seasonality. In addition, a seasonal unit root test was performed. The autocorrelation function and the partial autocorrelation function of the log-transformed variable are shown in S1 File. The first two lags were clearly significant in both ACF and PACF. In addition, lag 14-18, lag 24-25, and 34 to 36 were noticeably significant in ACF, indicating the seasonality of the distribution. Further, we applied the unit root tests to the original variable to confirm the stationarity of the mean and the seasonality. Table 1 shows the unit root test results for stationary mean and seasonality. According to the unit root test for the stationary mean, the mean is stationary in the series. At the same time, seasonality is significant in the series at a 5% significant level but not at a 1% significance.
As seasonality was significant in the variable, seasonal differencing was performed on the log-transformed variable. Log-transformed, seasonally differenced series was displayed in Fig  3C. Visually, seasonality is less evident in Fig 3C than in Fig 3B. However, outliers have become more prominent in seasonality adjusted series. The kernel density of Fig 3C indicates that more values are scattered at the vertical center while extending vertically towards the outliers. Therefore, the outliers were adjusted and interpolated using the linear function of Eviews software as described in the methodology section, and the adjusted series is displayed in Fig  3D. The kernel density of 3D indicates that the values are distributed evenly in the vertical axis compared to Fig 3C. To confirm that the seasonal differencing was successful, we repeated the seasonal unit root test for the seasonally differenced series. The results are displayed in S2 File. According to this, the test statistic increased from 9.8 to 15.4. Thus, the previous gap between the 5% significant level and the test statistic has been reduced. This indicates that the seasonal differencing has an effect on the variable. However, seasonality is still not significant at 1% significant level.
The ACF and PACF of log-transformed, seasonally differenced and, outlier adjusted series were plotted and displayed in Fig 4. The first lag in PACF and four lags in ACF were significant. However, significant first lag in PACF and exponential decay in ACF indicate AR(1) process. Also, MA(4) is also a distant possibility according to the correlogram. Significant lag 12 of ACF indicates a MA (12) or SMA(1) process. In PACF, lag 12 is not significant, while lag 24 is significant, indicating doubtful results regarding the seasonal AR process. Considering all these possibilities, the following ARMA models were fitted for the data series, and the results are displayed in Table 2.
According to Table 2, the best model to predict monthly leptospirosis cases in the Anuradhapura district was ARIMA(1,0,0)(0,1,1) 12. Seasonality, AR(1) and SMA(1) have been detected as the prominent features through plotting ACF and PACF. In addition, this model has the lowest values of AIC (Akaike information criterion), BIC (Bayesian information criteria), and HQ (Hannan-Quinn information). Further details on the best-fitted model are displayed in Table 3. Durbin Watson test statistic was 1.97, which was closer to, two and it indicates that there is no residual serial correlation. The ARCH test result was not significant, indicating heteroskedasticity was not observed in the residuals after fitting the model. (Table 3) Non-significant results of the Jarque-Bera test indicated that the residual follows a normal distribution. (S3 File) In addition, ACF and PACF were plotted for the residuals of the best-fitted models. (S4 File) As none of the lags were significant in residual correlograms, we can predict that the model has adequately described the observed variability. All other models displayed in Table 3 also described the observed variability satisfactorily; however, ARIMA (1,0,0)(0,1,1) 12 is the parsimonious model with a lesser number of parameters. The model specification of the best fitted model [ARIMA(1,0,0)(0,1,1) 12 ] was;

PLOS ONE
Predicting leptospirosis in Anuradhapura district Sri Lanka  Table 3). Fig 6A-6D show the distribution of monthly rainfall, rainy days, average daytime relative humidity, and the maximum temperature of the Anuradhapura district from January 2008 to December 2018.

Multivariate analysis
The results of the unit root tests of meteorological variables are displayed in Table 4. Accordingly, Rainfall and Temperature were stationary without differencing, while rainy days and relative humidity were stationary at the first difference. Therefore, including the dependent variable, three variables were stationary at I(0), and two variables were stationary at I(1).
The lag length of criteria to determine the best predictive lag length was observed for each independent variable separately and as a whole (S4 and S5 Files). The third lag was considered the best lag for plotting the ARDL model. ARDL models were fitted with a lag length of three, and the best selected ARDL models were displayed in Table 5. Table 6 shows the criteria of the best-selected model. ARDL (1,3,2,0,1) was selected as the best model to describe the observed variability with the lowest AIC value.
According to Table 6, the model has significantly described the observed variability. Patient numbers in the previous month significantly predicted leptospirosis in this month. Also, the average number of rainy days per month showed a positive association with patient numbers in lag 0, 1, and 2. However, only lag 2 and 3 of rainfall showed a positive association with leptospirosis. Relative humidity displayed a negative effect on leptospirosis. The temperature showed a positive association with the occurrence of leptospirosis. Residuals were checked for normality, heteroskedasticity and serial correlation to validate the ARDL model, and the residuals follow a normal distribution. (S6 File) Furthermore,  according to Table 6, the ARCH test and LM test were not significant, indicating that the residuals do not show heteroskedasticity and no serial correlation of the residuals. None of the lags were significant in the residual ACF and PACF, indicating the validity of the fitted model. Fig 7 shows the forecast values for the observed monthly leptospirosis cases by the ARDL model. Table 7 depicts the results of the best-fitted models with and without seasonal or general outliers. The best-fitted models described above are in bold. The effect of outlier adjustment for a model is discussed in detail in the discussion section.

Discussion
Using the Autoregressive distributed lag (ARDL) method, we showed that the leptospirosis case could be predicted using metrological parameters and the previous month case numbers in the dry zones in Sri Lanka. ARDL, VAR, VECM, ARCH, and GARCH are the possible models that can be used to determine the association of these meteorological parameters and to predict leptospirosis. In this analysis, the Autoregressive distributed lag(ARDL) method was selected over other techniques based on the mean stationarity of the five variables. However, ARDL models are not often used for disease predictions. We have shown that the ARDL models can be used for disease predictions as naturally existing variables cannot be expected to be in the same order of integration.
As Abraham et al. [31] proposed, we performed methodological changes in this model by adjusting the outliers after seasonal differencing instead of performing from the beginning. These outliers were observed neither seasonally nor in regular intervals ( Fig 3C). Therefore, these outliers observed during the non-leptospirosis season of the time series can directly influence the lack of fitting of the model. Univariate models cannot capture the influence of these external factors and multivariate models are needed to explain the influence of external factors on the occurrence of leptospirosis. The observed MAPE values are slightly higher than the accepted range in both univariate and multivariate models. This could be due to the non-inclusion of non-meteorological regressors such as human behavior-related factors into the model. As described in Table 7, residual heteroskedasticity and residual serial correlation were not observed in any models. However, the residual normality could only be achieved after seasonal outlier adjustments in the univariate models, while the residuals were normally distributed in multivariate models. As shown in Table 7, the best-fitted model was observed in the seasonality-adjusted series. (Lowest MAPE and AIC values) and a valid model could only be fitted to original data (without outlier adjustments) by multivariate modelling. It was evident that the seasonal outliers are a major reason for the mal-fitting of the models containing seasonal data. Leptospirosis outbreaks generally occur during the rainy season, extending from November to January. Thus, any outbreak during the dry season could be identified as a seasonal outlier. Therefore, adjusting the outliers in the original variable is not the best option in a seasonal data series. Higher case numbers occur

PLOS ONE
during a rainy season may not be an outlier, while even a small peak in the non-rainy season could be an outlier and cause the mal-fitting of the model. The techniques and their order used for outlier adjustments were not discussed adequately in previous publications. The average number of rainy days per month was associated with patient numbers in lag 0, 1, and 2. However, only lag 2 and 3 of rainfall was associated with leptospirosis. The standardized coefficients are also higher in rainy days than in rainfall. This might be due to continuous rain over a period of a few months, creating a favourable environment for the growth and survival of Leptospira. This finding is consistent with our previous work [5]. However, many previous studies suggest that rainfall is vital in predicting leptospirosis [15,32,33]. Most of these studies have not adjusted the collinearity effect of rainfall and rainy days. A study conducted in China showed that limited rainfall is associated with leptospirosis while heavy rainfall is not associated, probably because heavy rainfall can be negatively associated with leptospirosis as the natural habitats of rodents are destroyed [34].
In contrast to the previses published work which showed a lack of association between relative humidity and leptospirosis [5], our prediction model shows a negative association. Several studies suggest that warm, humid conditions are needed to survive L. interrogans outside the host [16,35]. However, some studies suggest that a higher relative humidity has a negative impact on rodents which is the natural host of Leptospira [36]. Also, evidence suggests that infective Leptospira species in different geographical regions could be different and some of the essential requirement for survival is different [21]. These facts may partially explain the observed association. The temperature shows a positive association with the occurrence of leptospirosis. Although this is a contrasting finding compared to previous studies, some suggest that temperature is associated with rodents' survival and rodent densities [35].
As mentioned in the introduction section, most of the time series models were univariate and developed targeting the districts in the wet zone of Sri Lanka. In comparison to the wet zone, accuracy of the univariate models seems less in the dry zone. As mentioned above, this could be attributed to the other factors influencing leptospirosis in the dry zone, while in the wet zone, leptospirosis mostly depends on the meteorological factors [5]. The highest coefficient observed for the constant clearly indicates that there are other factors influencing leptospirosis occurrence. This indicates that at least a few other major parameters are required to create better models to predict leptospirosis [37]. Leptospirosis disease transmission is a complex and dynamic process that involves human behaviours, susceptibility, non-human host behaviours and reproduction, growth and survival of Leptospira in the natural environment and the ecology of the concerning geography. All these factors could not be fitted to a single model. However, as mentioned above, these parameters could be included in the models developed in future studies. Despite the limitations mentioned, these models could effectively predict leptospirosis not only in the Anuradhapura district but also in the other districts of the dry zone with similar meteorological characteristics. In addition, these models could be validated further in other countries which share similar meteorological and sociodemographic characteristics.
In conclusion, we have demonstrated that seasonal outliers are a significant limitation of creating univariate models in relatively dry environments. In contrast, multivariate models had captured the seasonal outliers and showed residual normality, indicating that all the outliers were due to the effect of meteorological parameters. Leptospirosis cases in the previous month, rainfall, rainy days, and temperature had shown a positive association with leptospirosis, while relative humidity indicated a negative association. These models could be effectively utilized in public health policy and planning using the routinely available data. We propose further studies incorporating other contributory factors in addition to meteorological variables for better disease prediction.